In [1]:
#importing the required data 
import pandas as pd
import pandas_profiling as pp

2. Exploratoary Data Analysis (EDA).

2.1 Loading data

In [2]:
#reading the data
data= pd.read_csv("/users/abhishekkumar/downloads/vehicles.csv")

2.2 Profiling Data and initial EDA

In [3]:
#Data profiling 
pp.ProfileReport(data)
Out[3]:

2.2.1 Initial analysis

  1. "passenger" have more than 50% missing values, will drop the column.
  2. "vin","carfax_url" have significant missing values, will drop the column
  3. "Engine" and "Description" column has to be substituted as dropping them would lead to significant data point loss

As expected from the data points;

  1. Price/Mileage have negative co-relation.
  2. Year/Mileage have negative co-relation.
  3. Year/ Price have positive co-relation.
  4. Is Private / mileage have some level of positive co-relation which needs to be investigated
In [5]:
#Handling missing values
updata=data
In [18]:
#data Cleaning 
updata= data.drop(["carfax_url","passengers"],axis=1,inplace=False)
updata= updata.drop_duplicates(subset=['vin'])# dropping duplicates
updata["engine"].fillna("Value not available", inplace = True)
updata["description"].fillna("Description not available", inplace = True)
updata.to_csv('/users/abhishekkumar/downloads/vehicle_data_file.csv')